Modelling Non-verbal Sounds for Speech Recognition
نویسنده
چکیده
I N T R O D U C T I O N Recent experiments performed by two groups of researchers at CMU have gathered data on subjects using speech recognizers in office-like environments (Rudnicky, et al., 1989, Stern & Acero, 1989). These experiments are presented by the authors in these proceedings. Among other things, they show that non-verbal events (non-stationary noises) do create serious problems for speech recognizers. These sounds are generated both by the speaker and by the environment. Examples of noise generated by the speaker are breath noises, lip smacks, paper rustles, filled pauses, cough, clearing throat, etc. Environmental noise can be phone rings, door slams, other speakers in the background, typing, etc. We attempt to explicitly model classes of noise represented by these events in the context of an HMM based speech recognizer (Sphinx). Subjects were recorded performing the two tasks, spreadsheet and census data (alphanumeric). A significant percentage (approx 10% overall in each task) of the utterances contain phenomena of the type mentioned above. The utterances were transcribed using a set of noise words to represent non-signal events in the recording. Fourteen noise words were used." AH, BEEP, BREATI-INOISE, CLEAR_THROAT, COUGH, DOOR_SLAM, MOUTFLNOISE, MUMBLE, RUSTLE, PHONE_RING, SNIFF, SNEEZE, TAP and THUMP. For each of these noise classes, a phone was added to the phone set and a word consisting of only that phone was added to the lexicon. The standard Sphinx training routines were then used to train context dependent models for all phones except those representing noise. Context independent models were used for the noise phones. The simple word models for noise give no context since they are single tokens, and we did not use between-word models. For recognition, noise words are treated like Silence words. They are allowed to occur after any word, including themselves and other noise words. We use the Sphinx recognizer with only minor modifications to implement transitions to noise words and to allow utterances that are only noise or Silence. S P R E A D S H E E T T A S K Alex Rudnicky and Michelle Sakamoto gathered a large corpus of examples of users performing a spreadsheet task using voice (Rudnicky et al, 1989). They used an operational speech recognition system, not a PNAMBIC paradigm. The subjects spoke in a spontaneous manner and were recorded using a Sennheiser close talking microphone. The input to their system was continuous, recognition wasn't started by pressing a key just before 1This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 5167, under contract number N00039-85-C-0163. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the US Government.
منابع مشابه
Effects of modelling within- and between-frame temporal variations in power spectra on non-verbal sound recognition
Research on environmental sound recognition has not shown great development in comparison with that on speech and musical signals. One of the reasons is that the sound category of environmental sounds covers a broad range of acoustical natures. We classified them in order to explore suitable recognition techniques for each characteristic. We focus on impulsive sounds and their non-stationary fe...
متن کاملComparing Non-Verbal Vocalisations in Conversational Speech Corpora
Conversations do not only consist of spoken words but they also consist of non-verbal vocalisations. Since there is no standard to define and to classify (possible) non-speech sounds the annotations for these vocalisations differ very much for various corpora of conversational speech. There seems to be agreement in the six inspected corpora that hesitation sounds and feedback vocalisations are ...
متن کاملModelling of sound systems
This paper is an introduction to the computer modelling of systems of speech sounds. It focuses mainly on modelling that studies (the prerequisites of) evolution, so pure speech recognition is not treated. However, both cultural and biological evolution are considered as well as aspects of learning and social interactions that might be relevant to the evolution of speech. It will be argued that...
متن کاملCough detection in spoken dialogue system for home health care
This paper reports a cough detection technique in a spoken dialogue system for health care task. In conventional speech recognition systems, their targets are verbal sounds only and non-verbal sounds like coughs are processed as burden noises that cause mis-recognition. Coughing, however, is one of the most important barometers of daily health check, so the cough detection can be useful for che...
متن کاملArticulation-based sound perception in verbal repetition: a functional NIRS study
Verbal repetition is a fundamental language capacity where listening and speaking are inextricably coupled with each other. We have recently reported that the left inferior frontal gyrus (IFG) harbors articulation-based codes, as evidenced by activation during repetition of meaningless speech sounds, i.e., pseudowords. In this study, we aimed at confirming this finding and further investigating...
متن کاملImproved I-vector-based Speaker Recognition for Utterances with Speaker Generated Non-speech sounds
Conversational speech not only contains several variants of neutral speech but is also prominently interlaced with several speaker generated non-speech sounds such as laughter and breath. A robust speaker recognition system should be capable of recognizing a speaker irrespective of these variations in his speech. An understanding of whether the speaker-specific information represented by these ...
متن کامل